-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve __crystal_once
performance
#15216
Conversation
__crystal_once
performance
Codegen still declares an |
I'm also working on the codegen part (primarily to improve performance), but I think this shouldn't have any undefined behavior since the pointer is always cast into the enum type before accessing it, and a boolean has to be at least one byte in size. |
A cast doesn't cut it because LLVM is free to assume the unused bits in the flags are poison. |
@BlobCodes Very nice, thank you! 😍 @HertzDevil By poison you mean that the cast may feel safe, but that we'd be assuming some undefined behavior? LLVM doesn't make guarantees that the 7 other bits will be carried over, just that the value for an @straight-shoota That sounds safer, that'd need a new |
Co-authored-by: Sijawusz Pur Rahnama <[email protected]>
We should keep the existing implementation around for compatibilty with older compilers. I'm wondering why this is even a |
The main performance improvement is from the inlined fast-check, which is also possible without using the enum. Removing the
Creating a new |
Sounds like a great idea to separate the two changes into separate PRs. That's in general, and also in particular if we can merge one without any prerequisites. |
Based on the PR by @BlobCodes: crystal-lang#15216 The performance improvement is the usage of a i8 instead of an i1 boolean to have 3 states instead of 2, which permits to quickly detect recursive calls without an array + inline tricks to optimize the fast and slow paths. Unlike the PR: 1. Removes the need for a state maintained by the compiler. This keeps the ability for an older compiler to compile a new release of the compiler (or use a newer stdlib) but breaks the ability for a new compiler to compile an older release (or use an older stdlib)! 2. Doesn't use atomics: we still use a mutex that already guarantees the acquire/release memory ordering semantics, and __crystal_once_init is only ever called once in the main thread before any other thread can be started.
Based on the PR by @BlobCodes: crystal-lang#15216 The performance improvement is two-fold: 1. the usage of a i8 instead of an i1 boolean to have 3 states instead of 2, which permits to quickly detect recursive calls without an array; 2. inline tricks to optimize the fast and slow paths. Unlike the PR: 1. Doesn't use atomics: it already uses a mutex that guarantees acquire release memory ordering semantics, and __crystal_once_init is only ever called in the main thread before any other thread is started. 2. Removes the need for a state maintained by the compiler, yet keeps forward and backward compatibility (both signatures are supported).
Based on the PR by @BlobCodes: crystal-lang#15216 The performance improvement is two-fold: 1. the usage of a i8 instead of an i1 boolean to have 3 states instead of 2, which permits to quickly detect recursive calls without an array; 2. inline tricks to optimize the fast and slow paths. Unlike the PR: 1. Doesn't use atomics: it already uses a mutex that guarantees acquire release memory ordering semantics, and __crystal_once_init is only ever called in the main thread before any other thread is started. 2. Removes the need for a state maintained by the compiler, yet keeps forward and backward compatibility (both signatures are supported).
Co-authored-by: David Keller <[email protected]> Based on the PR by @BlobCodes: crystal-lang#15216 The performance improvement is two-fold: 1. the usage of a i8 instead of an i1 boolean to have 3 states instead of 2, which permits to quickly detect recursive calls without an array; 2. inline tricks to optimize the fast and slow paths. Unlike the PR: 1. Doesn't use atomics: it already uses a mutex that guarantees acquire release memory ordering semantics, and __crystal_once_init is only ever called in the main thread before any other thread is started. 2. Removes the need for a state maintained by the compiler, yet keeps forward and backward compatibility (both signatures are supported).
Co-authored-by: David Keller <[email protected]> Based on the PR by @BlobCodes: crystal-lang#15216 The performance improvement is two-fold: 1. the usage of a i8 instead of an i1 boolean to have 3 states instead of 2, which permits to quickly detect recursive calls without an array; 2. inline tricks to optimize the fast and slow paths. Unlike the PR: 1. Doesn't use atomics: it already uses a mutex that guarantees acquire release memory ordering semantics, and __crystal_once_init is only ever called in the main thread before any other thread is started. 2. Removes the need for a state maintained by the compiler, yet keeps forward and backward compatibility (both signatures are supported).
Description
This PR significantly improves
__crystal_once
performance.I originally wrote this code for a custom crystal stdlib focused on microchips without
malloc
.While the previous implementation used an array to keep track of all variables currently being initialized, this one (ab)uses the given
flag
boolean pointer as an enum with three values to represent the "being initialized" state.Also, the previous implementation always used a mutex when accessing any const variable using
-Dpreview_mt
.This implementation instead has a fast path for the (very likely) scenario that the variable was already initialized which doesn't need a mutex.
Let's talk numbers
Benchmark code and results on my machine can be found here:
https://gitlab.com/-/snippets/4772439